Introduction

In previous documents, I have:

  • defined LADs that change during the cell cycle
  • tried to link these to other data sets

This was done very inefficiently, unorganized and for the (three) cell types separately. I want to make this more generalized, structured and plot the resulting output in a more summarized fashion.

Method

Overall, this document will replace the documents linking the dynamic LADs to other data sets. The method will be the same, although I would like to design a different output figure to summarize all data much conveniently. The strategy is summarized below.

Analysis procedure

Various plots.

Set-up

Initially, I want to prepare the input and output. Load in the libraries and set other constants.

Set-up knitr

I always appreciate having the markdown figures as png / pdf, as plotted in the document. This is set below.

1) Load input

First, I will load the required input and prepare the data structure required.

3) LAD dynamics over the cell cycle

Plot the progression throughout the cell cycle.

## [1] "RPE"

Looks good. Note that the LADs can still decrease a lot after S. Of course, this result is now combined with the S-specific biology and thus unreliable.

for (cell in cells) {
  
  print(cell)
  
  # Combine replicates
  LADs.norm.combined <- LADs.norm.cell <-LADs.norm[[cell]]
  samples <- samples.df[[cell]]
  
  mcols(LADs.norm.combined) <- do.call(cbind, 
                                       tapply(samples$name,
                                              samples$phase,
                                              function(x) rowMeans(as(mcols(LADs.norm.cell)[, x], "data.frame"), 
                                                                   na.rm = T)))
  
  # Convert to df
  df <- as(mcols(LADs.norm.combined), "data.frame")
  
  # Add ID
  df$result <- LADs.norm.cell$G2_G1
  df$ID <- 1:nrow(df)
  
  # For each time point, determine the change per hour
  timepoints <- unique(samples$time)
  df.change <- cbind(df[, c("result", "ID")],
                     t(t(df[, 2:5] - df[, 1:4]) / (timepoints[2:5] - timepoints[1:4])))
  
  # Melt
  df.melt <- melt(df.change, id.vars = c("result", "ID"))
  df.melt$time <- as.numeric(gsub("h.*", "", gsub("t_", "", df.melt$variable)))
  #df.melt$time <- (df.melt$time + timepoints[match(df.melt$time, timepoints) - 1]) / 2
  
  # Plot
  plt <- ggplot(df.melt, aes(x = time, y = value, col = result, fill = result)) +
    stat_summary(fun.y = mean, geom = "line", size = 1.5) +
    stat_summary(fun.data = mean_sdl, geom = "ribbon", alpha = 0.25, col = NA,
                 fun.args = list(mult = 1)) +
    ggtitle("pA-DamID dynamics over time") +
    #facet_grid(. ~ result) +
    xlim(0, max(timepoints)) +
    xlab("time (h)") +
    ylab("change per hour") +
    scale_color_manual(values = brewer.pal(9, "Set1")[c(9, 1, 3)]) +
    scale_fill_manual(values = brewer.pal(9, "Set1")[c(9, 1, 3)]) +
    theme_bw() +
    theme(aspect.ratio = 1)
  
  plot(plt)
  
  # Also, make a different plot
  plt <- ggplot(df.melt, aes(x = variable, y = value, col = result, fill = result)) +
    geom_hline(yintercept = 0, col = "grey", linetype = "dashed") +
    geom_boxplot(outlier.shape = NA, col = "black") +
    xlab("time") +
    ylab("change per hour") +
    coord_cartesian(ylim = c(-0.42, 0.24)) +
    scale_x_discrete(labels = c("1h -> 3h", "3h -> 6h", "6h -> 10h", "10h -> 21h")) +
    scale_color_manual(values = brewer.pal(9, "Set1")[c(9, 1, 3)]) +
    scale_fill_manual(values = brewer.pal(9, "Set1")[c(9, 1, 3)]) +
    theme_bw() +
    theme(aspect.ratio = 1,
          axis.text.x = element_text(angle = 90, hjust = 1))
  
  plot(plt)
  
}
## [1] "RPE"

4) Gene expression

How do the dynamic LADs correlate with gene expression?

df.combined <- c()

for (cell in cells) {
  
  print(cell)
  
  # LAD status to genes
  LADs <- LADs.norm[[cell]]
  ovl <- findOverlaps(rnaseq, LADs, type = "within")
  
  # Convert to df
  df <- as(mcols(rnaseq), "data.frame")[, c("gene_id", "gene_name", cell)]
  names(df)[3] <- "expr"
  
  df$result <- factor("iLAD", levels = c(levels(LADs$G2_G1), "iLAD"))
  df$result[queryHits(ovl)] <- LADs$G2_G1[subjectHits(ovl)]
  
  # Complete cases
  df <- df[complete.cases(df), ]
  
  # Plot beeswarm
  plt <- ggplot(df, aes(x = result, y = expr, col = result, group = result)) +
    geom_quasirandom() +
    geom_boxplot(outlier.shape = NA, fill = NA, col = "black", width = 0.3) +
    ggtitle("Gene change versus gene expression") +
    xlab("pA-DamID change") +
    ylab("Gene expression (rlog2)") +
    scale_color_brewer(palette = "Set1") +
    theme_bw() +
    theme(aspect.ratio = 1)
  
  # plot(plt)
  
  
  # What are active genes?
  plt <- ggplot(df, aes(x = expr)) +
    geom_histogram(binwidth = 1) +
    geom_vline(xintercept = 7, col = "red", linetype = "dashed") +
    xlab("Gene expression (rlog)") +
    ylab("Count") +
    ggtitle("Gene expression histogram") +
    theme_bw() +
    theme(aspect.ratio = 1)
  
  # plot(plt)
  
  
  # Let's say rlog > 7
  cutoff <- 7
  
  # Gene density
  LADs$gene_density <- countOverlaps(LADs, rnaseq, type = "any") / width(LADs) * 1e6
  LADs$gene_density.active <- countOverlaps(LADs, rnaseq[mcols(rnaseq)[, cell] > cutoff], 
                                            type = "any") / width(LADs) * 1e6
  
  # Get the data frame
  df <- as(mcols(LADs), "data.frame")
  # df <- df[complete.cases(df), ]
  
  # Plot beeswarm
  plt <- ggplot(df, aes(x = G2_G1, y = gene_density, fill = G2_G1, group = G2_G1)) +
    geom_boxplot(outlier.shape = NA) +
    ggtitle("Gene change versus gene density") +
    xlab("pA-DamID change") +
    ylab("# genes / Mb") +
    scale_color_brewer(palette = "Set1", name = "result") +
    coord_cartesian(ylim = c(0, 18)) +
    theme_classic() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  
  # plot(plt)
  
  plt <- ggplot(df, aes(x = G2_G1, y = gene_density.active, fill = G2_G1, group = G2_G1)) +
    geom_boxplot(outlier.shape = NA) +
    ggtitle("Gene change versus gene density") +
    xlab("pA-DamID change") +
    ylab("# active genes / Mb") +
    scale_color_brewer(palette = "Set1", name = "result") +
    coord_cartesian(ylim = c(0, 6)) +
    theme_classic() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  
  # plot(plt)
  
  df$cell <- cell
  df.combined <- rbind(df.combined, 
                       df[, c("gene_density.active", "cell", "G2_G1")])
  
}
## [1] "RPE"

Variable.

7) Chromosomal positioning

Distance to telomeres / centromeres.

# Read chrom sizes
chrom_sizes <- read.table("/DATA/usr/t.v.schaik/data/genomes/GRCh38/hg38.chrom.sizes", sep = "\t")
row.names(chrom_sizes) <- chrom_sizes[, 1]

# Function to scale chromosome arms
ScaleChromosomeArms <- function(gr, chrom_sizes, centromeres.gr, inverse = FALSE) {
  # For a given "gr" GRanges object, calculate the relative distance to a
  # chromosome arm. 
  
  # Calculate distance to centromere for every gene
  gr$distance_to_centromere <- mcols(distanceToNearest(gr, centromeres.gr))$distance
  gr$distance_to_centromere <- as.numeric(gr$distance_to_centromere)
  
  # For each chromosome
  #   if start of gene is left of the centromere,
  #     normalize the distance by the first base - start centromere distance
  #   else
  #     normalize by the distance by the end centromere - last base
  for (chr in seqlevels(gr)) {
    centromeres.chr <- centromeres.gr[seqnames(centromeres.gr) == chr]
    left_arm <- start(centromeres.chr) - 1
    right_arm <- chrom_sizes[chr, 2] - end(centromeres.chr)
    
    idx <- seqnames(gr) %in% chr
    gr.tmp <- gr[idx]
    
    norm_dis <- ifelse(start(gr.tmp) < start(centromeres.chr),
                       gr.tmp$distance_to_centromere / left_arm,
                       gr.tmp$distance_to_centromere / right_arm)
    
    # If inverse, inverse the distance (meaning towards telomeres)
    if (inverse) {
      norm_dis <- 1 - norm_dis
    }
    
    mcols(gr[idx])[, "distance_to_centromere"] <- norm_dis
  }
  
  # Return normalized distances
  gr$distance_to_centromere
}

# Define telomeres & load centromeres
telomeres <- GRanges(seqnames = rep(seqlevels(LADs), times = 2),
                     ranges = IRanges(start = c(rep(1, length(seqlevels(LADs))),
                                                seqlengths(LADs)),
                                      end = c(rep(1, length(seqlevels(LADs))),
                                              seqlengths(LADs))))
centromeres <- import("ts190612_differential_analysis_BinsandLADs_RPE/centromeres.bed")

# Prep output df
df.combined <- c()

for (cell in cells) {
  
  LADs <- LADs.norm[[cell]]
  
  # Distance to telomeres data frame
  LADs$distance_to_telomeres <- mcols(distanceToNearest(LADs, telomeres))$distance / 1e6
  LADs$distance_to_centromeres <- mcols(distanceToNearest(LADs, centromeres))$distance / 1e6
  LADs$distance_to_telomeres_scaled <- ScaleChromosomeArms(LADs, chrom_sizes, 
                                                             centromeres, inverse = T)
  
  LADs$seqnames <- seqnames(LADs)
  
  # Add mean t_1h / t_21h
  mcols(LADs)[, "t_1h"] <- rowMeans(as(mcols(LADs), "data.frame")[, samples.df[[cell]]$name[samples.df[[cell]]$phase == "t_1h"]])
  mcols(LADs)[, "t_3h"] <- rowMeans(as(mcols(LADs), "data.frame")[, samples.df[[cell]]$name[samples.df[[cell]]$phase == "t_3h"]])
  mcols(LADs)[, "t_6h"] <- rowMeans(as(mcols(LADs), "data.frame")[, samples.df[[cell]]$name[samples.df[[cell]]$phase == "t_6h"]])
  mcols(LADs)[, "t_10h"] <- rowMeans(as(mcols(LADs), "data.frame")[, samples.df[[cell]]$name[samples.df[[cell]]$phase == "t_10h"]])
  mcols(LADs)[, "t_21h"] <- rowMeans(as(mcols(LADs), "data.frame")[, samples.df[[cell]]$name[samples.df[[cell]]$phase == "t_21h"]])
  mcols(LADs)[, "diff"] <- LADs$t_21h - LADs$t_1h
  
  # Add LAD size
  mcols(LADs)[, "size"] <- width(LADs) / 1e6
  
  df <- as(mcols(LADs), "data.frame")
  
  df$cell <- cell
  df.combined <- rbind(df.combined, 
                       df[, c("distance_to_telomeres", "distance_to_centromeres", 
                              "distance_to_telomeres_scaled", "t_1h", "t_3h", 
                              "t_6h", "t_10h", "t_21h", 
                              "diff", "size", "cell", "G2_G1", "seqnames")])
  
  
}

df.combined$cell <- factor(df.combined$cell, levels = c("RPE"))

# Order chromosomes by length
df.combined$seqnames <- factor(df.combined$seqnames,
                               levels = seqlevels(LADs)[order(seqlengths(LADs), 
                                                              decreasing = T)])

plt <- ggplot(df.combined, aes(x = cell, y = distance_to_telomeres, 
                               fill = G2_G1)) +
  geom_boxplot(outlier.shape = NA, position = "dodge") +
  ggtitle("Telomeres") +
  xlab("") +
  ylab("Distance to telomeres (Mb)") +
  scale_x_discrete(drop = FALSE) +
  scale_fill_manual(values = brewer.pal(9, "Set1")[c(9, 1, 3)],
                    name = "result") +
  #coord_cartesian(ylim = c(0, 8)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

plot(plt)

10) Linear modeling

I want to make the point that LAD size and distance to telomeres are mostly independent. Bas suggested linear modeling with an interaction term. Let’s have a look.

## 
## Call:
## lm(formula = slope ~ distance_to_telomeres + size + distance_to_telomeres * 
##     size, data = df.combined)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.076115 -0.010520  0.002142  0.012578  0.050634 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -2.123e-02  1.216e-03 -17.460  < 2e-16 ***
## distance_to_telomeres       3.167e-04  2.645e-05  11.972  < 2e-16 ***
## size                        4.000e-03  4.191e-04   9.544  < 2e-16 ***
## distance_to_telomeres:size -2.526e-05  8.647e-06  -2.921  0.00355 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01931 on 1172 degrees of freedom
## Multiple R-squared:  0.2107, Adjusted R-squared:  0.2087 
## F-statistic: 104.3 on 3 and 1172 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = G2_min_G1 ~ distance_to_telomeres + size + distance_to_telomeres * 
##     size, data = df.combined)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62306 -0.22360  0.05721  0.27485  1.10506 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -0.4770592  0.0264201 -18.057  < 2e-16 ***
## distance_to_telomeres       0.0076990  0.0005748  13.394  < 2e-16 ***
## size                        0.0876348  0.0091065   9.623  < 2e-16 ***
## distance_to_telomeres:size -0.0005745  0.0001879  -3.058  0.00228 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4195 on 1172 degrees of freedom
## Multiple R-squared:  0.2346, Adjusted R-squared:  0.2326 
## F-statistic: 119.7 on 3 and 1172 DF,  p-value: < 2.2e-16

## [1] 0.01478213
## [1] 0.01064637

Based on these extreme weak correlations and the relatively low interaction term, I conclude that these features are mostly independent.

Conclusion

Important message: there is no correlation between DNA characteristics and lamina cell cycle dynamics. The main feature really is telomere positioning.

SessionInfo

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] knitr_1.22           gtools_3.8.1         edgeR_3.26.0        
##  [4] limma_3.40.0         ggbeeswarm_0.6.0     RColorBrewer_1.1-2  
##  [7] reshape2_1.4.3       ggplot2_3.1.1        rtracklayer_1.44.0  
## [10] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0  IRanges_2.18.0      
## [13] S4Vectors_0.22.0     BiocGenerics_0.30.0 
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.44.0              splines_3.6.1              
##  [3] Formula_1.2-3               assertthat_0.2.1           
##  [5] latticeExtra_0.6-28         GenomeInfoDbData_1.2.1     
##  [7] vipor_0.4.5                 Rsamtools_2.0.0            
##  [9] yaml_2.2.0                  pillar_1.3.1               
## [11] backports_1.1.4             lattice_0.20-38            
## [13] glue_1.3.1                  digest_0.6.18              
## [15] XVector_0.24.0              checkmate_1.9.3            
## [17] colorspace_1.4-1            htmltools_0.3.6            
## [19] Matrix_1.2-17               plyr_1.8.4                 
## [21] XML_3.98-1.19               pkgconfig_2.0.2            
## [23] zlibbioc_1.30.0             purrr_0.3.2                
## [25] scales_1.0.0                BiocParallel_1.18.0        
## [27] tibble_2.1.1                htmlTable_1.13.1           
## [29] withr_2.1.2                 SummarizedExperiment_1.14.0
## [31] nnet_7.3-12                 lazyeval_0.2.2             
## [33] survival_2.44-1.1           magrittr_1.5               
## [35] crayon_1.3.4                evaluate_0.13              
## [37] foreign_0.8-71              beeswarm_0.2.3             
## [39] data.table_1.12.2           tools_3.6.1                
## [41] matrixStats_0.54.0          stringr_1.4.0              
## [43] munsell_0.5.0               locfit_1.5-9.1             
## [45] cluster_2.0.9               DelayedArray_0.10.0        
## [47] Biostrings_2.52.0           compiler_3.6.1             
## [49] rlang_0.3.4                 grid_3.6.1                 
## [51] RCurl_1.95-4.12             rstudioapi_0.10            
## [53] htmlwidgets_1.3             labeling_0.3               
## [55] bitops_1.0-6                base64enc_0.1-3            
## [57] rmarkdown_1.12              gtable_0.3.0               
## [59] R6_2.4.0                    GenomicAlignments_1.20.0   
## [61] gridExtra_2.3               dplyr_0.8.0.1              
## [63] Hmisc_4.2-0                 stringi_1.4.3              
## [65] Rcpp_1.0.1                  rpart_4.1-15               
## [67] acepack_1.4.1               tidyselect_0.2.5           
## [69] xfun_0.6